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We consider matching with shifts for Gibbsian sequences. We 
prove that the maximal overlap behaves as clogn, where c is explicitly 
identified in terms of the thermodynamic quantities (pressure) of the 
underlying potential. Our approach is based on the analysis of the 
first and second moment of the number of overlaps of a given size. 
We treat both the case of equal sequences (and nonzero shifts) and 
independent sequences. 

1. Introduction. In sequence alignment one wants to detect significant 
similarities between two (e.g., genetic or protein) sequences. In order to dis- 
tinguish "significant" similarities, one has to compute the probability that a 
similarity of a certain size occurs for two independent sequences. The sym- 
bols in the sequences are, however, not necessarily occurring independently. 
Prom the point of view of statistical mechanics, it is quite natural to assume 
that the symbols in the sequence are generated according to a stationary 
Gibbs measure: this is the equilibrium measure which maximizes the en- 
tropy under physical constraints such as energy conservation. A priori there 
is no reason to assume that the symbols (bases) in, for example, a DNA 
sequence, are i.i.d. or even Markov. It can, however, be plausible to assume 
that there is an underlying Markov chain of which the symbol sequence is 
a reduction: in that case we arrive at a so-called hidden Markov chain, and 
it is well known that hidden Markov chains have generically infinite mem- 
ory (though the symbol at a particular location only exponentially weakly 
depends on symbols far away). Therefore, proposing a Gibbs measure with 
exponentially decaying interaction as a model for the sequence seems quite 
natural. Besides motivation coming from sequence alignment, also in dynam- 
ical systems, [4] one can ask for the probability of having a large "overlap" in 
a trajectory of length n, but without specifying the location of the piece of 
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trajectory that is repeated. It is clear that this probabihty is related to the 
entropy, but not in such a straightforward way as the return time. In (hyper- 
bolic) dynamical systems, by coding and partitioning, one again naturally 
arrives at Gibbs measures with exponentially decaying interactions. 

The first nontrivial problem associated with sequence alignment is the 
comparison of two sequences where it is allowed to shift one sequence w.r.t. 
the other. Remark that this problem is not easy even in the case of indepen- 
dent symbols in the sequence, because one allows for shifting one sequence 
w.r.t. the other. The comparison consists in the simplest case in finding the 
maximal number of consecutive equal symbols. Given two (independent) 
i.i.d. sequences, in [5] and [6] it is proved that the maximal overlap, al- 
lowing shifts, behaves for large sequence length as clogn + X, where n is 
the length of both sequences, c is a constant depending on the distribution 
of the sequence, and where X is a random variable with a Gumbel distri- 
bution. The fact that clog(n) is the good scale can be easily understood 
intuitively: it corresponds to the maximum of order n weakly dependent 
variables. However, even in the case of i.i.d. sequences, it is not so easy to 
make that intuition rigorous, as we allow shifts. In fact, the results of [5] 
and [6] are based on large deviations, together with an analysis of random 
walk excursions. As the proofs use a form of permutation invariance, they 
cannot be extended to non-i.i.d. cases. In [9] the maximal alignment with 
shift is shown for Markov sequences, which requires a theory of excursions 
of random walk with Markovian increments. 

In this paper we focus on the more elementary question of showing that 
the maximal overlap allowing shifts behaves as clogn, but now in the context 
of general Gibbsian sequences. We also allow to match a sequence with itself 
(where of course we have to restrict to nonzero shifts). The constant c is 
explicitly identified and related to thermodynamic quantities associated to 
the potential of the underlying Gibbs measure. 

Our approach is based on a first and second moment analysis of the ran- 
dom variable N{a, n, k) that counts the number of shift-matches of size A; in a 
sequence a of length n. One easily identifies the scale k = kn = clog(n) which 
discriminates the region where the first moment EA^((7, n, kn) goes to zero (as 
n — > oo) from the region where 'EN(a,n,k) diverges. Via a second moment 
estimate, we then prove that this scale also separates the N(a,n,k) 
versus N(a,n,k) — > oo (convergence in probability) region. 

Our paper is organized as follows: in Section 2 we introduce the basic 
preliminaries about Gibbs measures, in Section 3 we analyze the first mo- 
ment of in the case of matching a sequence with itself and in Section 
4 we study the second moment. In Section 5 we treat the case of two in- 
dependent (Gibbsian) sequences with the same and with different marginal 
distributions. 
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2. Definitions and preUminaries. We consider random stationary sequences 
[8] a = {cr{i) :i € Z} on the lattice Z, where a{i) takes values in a finite 
set A. The joint distribution of {cr{i) : i S Z} is denoted by P. We treat 
the case where P is a Gibbs measure with exponentially decaying inter- 
action; see Section 2.3 below for details. The configuration space Q = 
is endowed with the product topology (making it into a compact metric 
space). The set of finite subsets of Z is denoted by S. For V,W (zS, we 
put d(y, W) = min{|i — £V,j & W}. For V the diameter is defined 
via diam(V) = max{|i — j\,i,j G V^}. For V G S, Ty is the sigma-field gen- 
erated by {aii) : i G A}. For F G 5, we put VLy = A^ . For cr G 17 and V €S, 
cry G f^y denotes the restriction of a to V. For i G Z and a G 17, Tjcr denotes 
the translation of a by i:Tia{j) = a{i + j). For a local event £^ C fi, the 
dependence set of E is defined by the minimal V € S such that E is J^y 
measurable. We denote 1 for the indicator function. 

2.1. Patterns and cylinders. For n G N,n > 1, let Cn = [i-,n] D Z. An el- 
ement An G ^Icn is called a n-pattern or a pattern of size n. For a pattern 
An G ^c„, we define the corresponding cylinder '^{An) = {a G : ac„ = An}. 
The collection of all n-cylinders is denoted by ^„ = UA„erJc '^(^n)- Some- 
times, to denote the probability of the cylinder associated to the pattern 
An, we will use the abbreviation 



For = (cj(l), cj(2), . . . , cr(fc)) a A;-pattern and 1 <i < j <n, we define the 
pattern Ak{i, j) to be the pattern of length j — i + 1 consisting of the symbols 
{a{i),a{i + 1), . . . ,a{j)). For two patterns Ak, Bi, we define their concate- 
nation Ai^Bi to be the pattern of length k + 1 consisting of the k symbols of 
Ak followed by the I symbols of Bi. Concatenation of three or more patterns 
follows obviously from this. 

2.2. Shift-matches. We will study properties of the following basic quan- 
tities. 

Definition 2.1 (Number of shift-matches). For every configuration a G 
0, and for every n G N, /c G N, with k <n, we define the number of matches 
with shift of length /c up to n as 



(2.1) 



F{An) := P(^(A„)) = P((TC„ = An). 



N{a,n,k) = -Y, E Hiria)c, = {Tja)c,} 



i=0 j=0,jj^i 



n—k 



(2.2) 



= Yl Hc7{i + l) = a{j + l),a{i + 2)=a{j+2),..., 



a{i + k) = a{j + k)). 
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Definition 2.2 (Maximal shift-matching). For every configuration a € 
and for every n € N, we define M((T, n) to be the maximal length of a 
shift-matching up to n, that is the maximal A; € N (with k <n) such that 
there exist i G N and j G N (with 0<i<j<n — k) satisfying 

(2-3) iTi(^)Ck = iTj(^)Ck, 

where we adopt the convention max(0) = 0. 

Definition 2.3 (First occurrence of a shift-matching). For every config- 
uration (T € and for every /c € N, we define T(a, k) to be the first occurrence 
of a shift-match, that is, the minimal n € N (with k <n) such that there 
exist z G N and j € N (with 0<i<j<n — k) satisfying 

(2-4) {na)c, = {Tja)c„ 

where we adopt the convention min(0) = oo. 

The following proposition follows immediately from these definitions. 

Proposition 2.4. The probability distributions of the previous quanti- 
ties are related by the following "duality" relations: 

(2.5) F{N{a, n, k) = 0)= F{M{a, n) < k) = P(T(cj, k)>n). 

2.3. Gibbs measures. We now state our assumptions on P, and recall 
some basic facts about Gibbs measures [11]. The reader familiar with this 
can skip this section. 

We choose for P the unique Gibbs measure corresponding to an expo- 
nentially decaying translation-invariant interaction. In dynamical systems 
language this corresponds to the unique equilibrium measure of a Holder 
continuous potential. 

2.3.1. Interactions. 

Definition 2.5. A translation- invariant interaction is a map 

(2.6) U:Sxn^R, 
such that the following conditions are satisfied: 

1. For all ^ E 5, a {A, a) is .^A-measurable. 

2. Translation invariance: 



(2.7) U{A + i, T-ia) = U{A, a) VA e S, i e Z, a eQ. 
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3. Exponential decay: there exist 7 > such that 

(2.8) ||f/|l^:=Ee^'^'^'"^^^sup|f/(AcT)| <oo. 

Abo ''^^ 

The set of ah such interactions is denoted by lA. Here are some standard 
examples of elements of U: 

1. Ising model with magnetic field h:A = {—1,1}, U({i,i + 1\, a) = Jaiai+i, 
U{{i},a) = h(Ti and all other U{A, a) = 0. Here J, /i e M. If J < 0, we have 
the standard ferromagnetic Ising model. 

2. General finite range interactions. An interaction U is called finite-range if 
there exists an i? > such that U{A, c) = for all j4 G 5 with diam(74) > 
R. 

3. Long range Ising models U ({z, j}, cr) = Jj^-ia^aj with \ Jk\ < e~'^^ for some 
7 > and U{A, a) = for all other AeS. 

2.3.2. Hamiltonians. For ?7 G Z//, G ri, A G 5, we define the finite-volume 
Hamiltonian with boundary condition C, as 

(2.9) Hi{a)= U{A,aKa^) 

and the Hamiltonian with free boundary condition as 

(2.10) Ht,{a)=Y.U{A,a), 

which depends only on the spins inside A. In particular, for Aj^ a pattern, 
a G 'ia{Ak), Hcf.{(y) depends only on A}.. We will denote, therefore, 

H{^{Aj,))=Hc,{cy) 

for f7G'^(Afc)- 

Corresponding to the Hamiltonian in (2.9), we have the finite- volume 
Gibbs measures P^'^, A G 5, defined on O by 



(2.11) lf{oK'Ho= E /(^aCa 



7^ 



where / is any continuous function and denotes the partition function 
normalizing P^'^ to a probability measure: 

(2.12) 4= E e-^^^"^- 
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2.3.3. Gibbs measures with given interaction. For a probability mea- 
sure P on 17, we denote by the conditional probability distribution of 
(T{i),i £ A, given a\c = (/^c. Of course, this object is only defined on a set 
of P-measure one. For A G 5,r G 5 and A C F, we denote by Pr(o"A|C) the 
conditional probability to find a a inside A, given that C occurs in F \ A. 

Definition 2.6. For U € Z//, we call P a Gibbs measure with interaction 
U if its conditional probabilities coincide with the ones prescribed in (2.11), 
that is, if 

(2.13) P^ = P^'^ P-a.s. A G 5, C G 

In our situation, with U gU, the Gibbs measure P corresponding to U is 
unique. Moreover, it satisfies the following strong mixing condition: for all 
V,WeS and all events A G Ty, B G Tw, 

¥{Ar\B) 



(2.14) 



P(^) 



< e 



-cd{V,W) 



where c > depends of course on the interaction U . 

2.4. Thermodynamic quantities. We now recall some definitions of basic 
important statistical mechanics quantities. 

Definition 2.7. The pressure p{U) of the Gibbs measure P associated 
with the interaction U is defined as 

(2.15) p{U)= lim -logZ„, 

where 



is the partition function with the free boundary conditions. 

Definition 2.8. The entropy s{U) of the Gibbs measure P associated 
with the interaction U is defined as 

(2.16) s(C/)=lim-i ^ PCr(^„))logP(^(A„)). 

In terms of the interaction U, we have the following basic thermodynamic 
relation between pressure, entropy and the Gibbs measure P corresponding 
to U: 

(2.17) s{U)=p{U)+ J fudF, 
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where 

Abo 1^1 

denotes the average internal energy per site. 

We also have the following relation between fjj and the Hamiltonian: 

(2.18) Hi{a)=J2nfu{a)+0{l), 

where 0(1) is a quantity which is uniformly bounded in A,a,(,. 

The function fu is what is called the potential in the dynamical systems 
literature. An exponentially decaying interaction U then corresponds to a 
Holder continuous potential fu- 

The following is a standard property of (one-dimensional) Gibbs measures 
with interaction U £U. For the proof, see [3], page 7. See also [7], pages 164- 
165 for properties of one-dimensional Gibbs measures. 

Proposition 2.9. For the unique Gibbs measure P with interaction U , 
there exists a constant 7 > 1 such that, for any configuration o" G f2 and for 
any pattern S ; we have 

(2.19) 7-ie-'=f(^)e-^('^(^^)) < FCr(AO) < 7e-^^P(^)e-^(^^(^^-)). 

Two other well-known properties of Gibbs measures in d= 1, which will 
be used often, are listed below. 



Proposition 2.10. For the unique Gibbs measure P corresponding to 
the interaction U ^U, there are constants p < 1 and c > 0, such that, for all 
Ak G r^Cfc o-nd for all r] gQ, 

(2.20) P(c7c, = Ak) < / 
and 

(2.21) c-ip(cjc7, = Ak) < P(fTc, = Ak\vz\c,) < H'^c, = Ak)c. 

Proof. Inequality (2.20) follows from the finite-energy property, that 
is, there exists 5 > such that, for all cr, 

0<6< F{ai = ai\az\{i}) < (1 - S). 

This in turn follows from 

exp{-Hf^^{ai)) 



o",; = a. 



kz\{i}) 



EaeA^M-Hfiyia)) 
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and 

sup H?n{ai) < oo 

by the exponential decay condition (2.8). 
Therefore, 

¥{ac, = ^fc) < n sup = a^\aj^\{i}) < (1 - 6)\ 

Inequalities (2.21) are proved in [7], Proposition 8.38 and Theorem 8.39. □ 

2.5. Useful lemmas. In the proofs of our theorems we will frequently 
make use of the following results. 

Lemma 2.11. For q>0, the function is nonincreasing. 

Proof. From the definition oi p{U) and s{U) and from the thermody- 
namic relation (2.17), which is equivalent to s = p — q^, it follows immedi- 
ately 

drp{qU)\_ sjqU) 
dq\ q ) q^ ' 

The claim is then a consequence of the positivity of the entropy. □ 

In order to state the next lemma, we need the following notation which 
will be used throughout the paper. 

Definition 2.12. Let and be two sequences of positive numbers. 
Then we write 

if log(afc) — log(6fc) is a bounded sequence and 
if 

ak < Ck 

with Ck^hk- 

Note that we have that ~ and ^ "behave" as ordinary equalities and 
inequalities and are "compatible" with usual equalities and inequalities. For 
example, if ^ and ^ c^, then ^ c^, if ~ b^ and b^ < c^, then 
a-k^Ck, etc. 
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Lemma 2.13. Define 

(2.22) „^,(u)-&^. 

We have a> and 

(2.23) J2 [n^c,=A,)f^e- 
while, for s > 2, 

(2.24) J2 mc^c,=Ak)r^e-^"^. 

Proof. The positivity of a follows from Lemma 2.1L Prom Proposition 
2.9 we obtain 

[P(cTc, = A.)]' « E e-2'^P(^)e-2^(^(^^)) 



Por s > 2, we have 



g-2fc[p(C/)-p(2(7)/2] ^g-2afc_ 



skp{U)~sH{^e{Ak)) 



~ Q-sk\p{U)-p{sU)/s\ ^ g-^a's^ 

where in the last inequality we have used the monotonicity property of 
Lemma 2.11. □ 

3. The average number of shift matches. We will focus on the quantity 
N{a,n,k) of Definition 2.1 and we will study how the number of shift- 
matchings behaves when the size of the matching, k, is varied as a function 
of the string length, n. It is clear that when k = k{n) is very large (say, of 
the order of n), then there will be no matching of size k with probability 
close to one, in the limit n — s- oo. On the other hand, if fc = k{n) is too small, 
then the number of shift-matchings will be very large with probability close 
to one. We want to identify a scale k*(n) such that N(a,n,k* (n)) will have 
a nontrivial distribution. Our first result concerns the average of N(a,n,k). 
Define 

(3.1) k*{n) = — 

a 

with a as in (2.22). Por sequences k'{n) and k{n), we write k{n) ^ k'{n) if 
k{n) — k'{n) ^ oo as n — > oo. 

Then we have the following result. 
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Theorem 3.1. Let {k{n)}n£N be a sequence of integers. Then we have 
the following: 

1. If k* (n) ^ k{n) , t/ien lim„_^oolE(7V(<T,n, A:(n))) = oo. 

2. If k{n):^k*{n), then \min^oo^{N{a,n,k{n))) =Q. 

3. If k{n) — k*{n) is a hounded sequence, then we have 

(3.2) 0<liminfE(iV(o-,n,A;(n))) <limsupE(iV(o-,?i,/c(n))) <oo. 

Proof. We will assume (without loss of generality) that the sequence 
is such that 

kin) 
lim = 0. 

n — >oo fi 

We may rewrite N{a, n, k) by summing over all possible patterns of length 

k: 

n—k n—k 

1=0 j=i+l AkSUc^. 

We split the above sum into two sums, one (Sq) corresponding to absence 
of overlap between {Tia)ck and {TjO-)c^ (i.e., the indices i and j are more 
than k far apart) and one (Si) where there is overlap: 

n-~2k n~k 

'5*0= E E E HiTi(^)Ck=iTjCr)Ck=Ak}, 
n—k i+k 

"51 = E E E Hiria)c, = iTja)c,=Ak}. 

We have of course K{N{a,n, k)) = E(S'o) +E(5i). In order to prove the first 
statement of the theorem, it suffices to show that E(S'o) diverges under the 
hypothesis k*{n) ^ k{n). Using translation-invariance, one has 

n—k 

E(So) = E("-^ + l-0 E n'^c, = {ria)c,=A,) 

l=k AfeGt^Cfe 
n—k 

= Y^^n-k + l-l) n(^c,=Ak)nMc,=Ak\ac,=Ak). 
i=k Afce^cj^ 

Because of the mixing conditions (2.14), we have 

n—k 

(3.3) E(5o) = E(^-fc + l-0 E [n<yc,=Ak)? + \{n,k), 

i=k Afcef^Cfc 
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where the error A(n, k) is bounded by 

n—k 

\A{n,k)\<0{l)Y.{n-k + l-l) n'^c,=Akfe-<'-''l 
i=k A^enc^ 

Using the mixing property (2.14) and Lemma 2.13, the error can be bounded 

by 

n-2k 

(3.4) |A(7i,yt)| <0(l)e-2"'= ^ (n - 2yfc - m + 1)6"'='" < 0(1)6"^"^ 

m=0 

On the other hand, applying Lemma 2.13, we have that 

n—k 

(3.5) (n-fc + l-/)^P(^fc)^«(n-2/t)2e-2"'=. 

l=k+l Afc 

Combining together (3.3), (3.4) and (3.5), we obtain 

(3.6) {n-2kfe~'^'''' ^E{N{a,n,k)), 

which proves statement 1 of the theorem. 

To prove statement 2, we have to control E(5i), which is the contri- 
bution to K{N{cr,n,k) due to self-overlapping cylinders. Using translation- 
invar iance, we have 

k—l 

E(5i) = ^(n-fc + l-0 Y n^c, = iria)c,=Ak). 

1=1 AfcG^Cj^ 

We further split this in two sums, namely, E(5i) =E(S'J) -|-E(S'[') with 

lk/2] 

(3.7) E{S[) =J2{n-k + l-l) Y n<^c, = ina)c, = A^), 

1=1 AfcGf^Cfe 

(3.8) E{S'I)= Y (n-k + l-l) Y n'^c, = {ria)c,=Ak). 

l=lk/2]+l AfcGf^Cfc 

Let us consider first E(5('), that is, [/c/2j <l < k. In this case the overlap 
between Ck and r/Cfc imposes that the sum over cylinders of length k can 
be reduced to a sum over cylinders of length I. In the notation of Section 
2.1, we have the following inequality: 

l(o-Cfc = iT-icr)c\ = Ak) 

(3.9) 

< l(cjc,+, = Ak{l, l)Ak{l, l)Ak{l, k - I)). 
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In fact, if the pattern is such that the set {a ^Q: ac^ = {jia)c,. = Ak} 
is not empty, then we have equahty in (3.9). Hence, 

¥{ac, = {Tia)c,=Ak) 
= E E ^^^c, = AiBk-i, Mc, = AiBk-i) 

(3.10) 

<Y,¥{ac,^^=AiAiAi{l,k-l)) 

Ai 

r<ElP(A)MA(l,fc-0), 

where in the first inequahty we used the fact that contributions with i?fc-z 7^ 
Ai{l,k — I) are zero. Therefore, using Proposition 2.10, we obtain 

nS'D< E {n-k-l)Y,¥{Aifp^~K 

l=\k /2\+l Ai 

From this we deduce, thanks to Lemma 2.13, 

k 

;=Lfc/2j+i 

k 

<{n- k)e~^" P^~^ 

l=\k/2\+l 

oo 

<{n- k)e-'"' Y 

x=0 

^ (n- fc)e-'=". 

We now treat E(5j^), that is, the case with 1 < / < [fc/2j . Write k = rl + q 
with r and s integers, r > 2, < g < Z — 1. If the set {a : ac^ = {jia)c^ = A^} 
is not empty, then the pattern Ak has to consist of r + 1 repetitions of the 
subpattern ^^(1,/) followed by a subpattern Afc(l,(?), where q is such that 
{r + l)l + q = k + l. Hence, 

(3.12) l{ac, = {Tia)c, = A^) < t{<Jc,+, = ^.(1, 1)--- Ak{l, I) Ak{l, q)). 

r+l times 

At this stage one could repeat the same approach as in the previous esti- 
mate for E(5J') by immediately employing Proposition 2.10. However, this 
approach would not work because the repeating blocks are two small. To cir- 
cumvent this, we observe that in the pattern [ylA,.(l, l)]^^^ A]^{\, q) there exists 
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a piece of length \k/2\ which occurs at least two times, and the remaining 
I symbols are fixed by that piece. Therefore, using Proposition 2.10, 

(3.13) = ^^i^)c, =^k)< E nBik/2\ fp'- 

By inserting (3.13) in (3.7) and using Lemma 2.13, we finally have 

(3.14) E(S;) ^ (n-A;)e-'=". 

Combining together the estimates (3.5), (3.11) and (3.14), we obtain so far 

(3.15) E(iV((j, n, k)) ^ (n - k)e-^'' + (n - 2kfe-^''" 

from which statement 2 of the theorem follows. 

Finally, combining (3.6) and (3.15) gives statement 3 of the theorem. □ 

4. Second moment estimate. In this section we will show that the ran- 
dom variable N{a,n,k{n)) converges in probability to +00 in the regime 
where fc(n) <^ k*{n), while it converges to in the opposite regime k{n) ^ 
k*{n). Finally, if the difference k(n) — k*{n) is bounded, then we show that 
N{a, n, k{n)) is tight and does not converge to zero in distribution. These re- 
sults will follow as an application of the method of first moment and second 
moment, respectively. 

Theorem 4.1. Let {k[n)}nei<i be a sequence of integers. For every pos- 
itive ?n G N.' 

1. If k* {n) k{n) , then lim.n-.oo'^ {N {a, n, k{n)) <m) ={). 

2. If k{n):$> k*{n), then limn->oo HN{a,n, kin)) > m) = 0. 

3. If k{n) — k*[n) is hounded, then N{a,n,k{n)) is tight and does not con- 
verge to zero in distribution. More precisely, we have that there exists a 
constant C > such that 

(4.1) limsupF{N {a,n, k{n)) > m) <C/m 

n— >oo 

and 

(4.2) liminfP(iV(cr,n,/c(n)) >0) >0. 

n — >oo 

Proof. We will assume, once more, without loss of generality that 

k(n) 
lim = 0. 

n — >oo fi 

Statement 2 and (4.1) follow from Theorem 3.1 and the Markov inequality. 
To prove statement 1 and (4.2), we use the Paley-Zygmund inequality [10] 
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(which is an easy consequence of the Cauchy-Schwarz inequahty), which 
gives that for ah < o < 1 

(4.3) P(iV>aE(iV))>(l-a)2^|^. 

We fix now a sequence k^] oo such that kn- Consider the auxihary 
random variable 

(4.4) Mn:= E U{na)c,^={T,a)cJ. 

i,j=0,\i-j\>2k„ 

Clearly, to obtain statement 1, it is sufficient that Mn goes to infinity with 
probability one. On the other hand, using the first moment computations of 
the previous section, we have 

(4.5) E(AA„,)p^nV2"'=-. 

So, in order to use the Paley-Zygmund inequality, it is sufficient to show 
that 

(4.6) 

where we introduced the notation 

(4.7) ^„:=ne-"*^". 

Remark that ^„ — > oo for our choice of A;„ (as in statement 1). 

Indeed, if we have (4.6) in the regime k*{n) 3> k{n), then the ratio 

E(A/'2) 



remains bounded from above as n ^ oo, and hence, using (4.3), Mn di- 
verges with probability at least 5 > 0. Therefore, in that case, by ergodicity, 
N(a,n,kn) > A/"n goes to infinity with probability one, since the set of a's 
such that N[a,n,kn) goes to infinity is translation- invariant, and hence has 
measure zero or one. 

To see how statement (4.2) follows from (4.6) in the regime where k{n) — 
k*{n) is bounded, use the (more classical) second moment inequality 

combined with 

N{a,n,k{n)) >M. 
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We now proceed with the proof of (4.6). We have 

i,j,r,s,\i-j\>2k„,\i — s|>2fc„ Ak„,Bk„ 

(4.8) 

where we use the abbreviate notation {Ak„)i for the event {jia)c^. = Ak„. 
Similarly, if we have a word of length /, say, consisting of p symbols of Ap 
followed hy l—p symbols of Bi_p, we write {ApBi_p)i for the event that this 
word appears at location i, that is, the event {Tia)ci = ApBi_p. 

The sum in the right-hand side of (4.8) will be split into different sums, 
according to the amount of overlap in the set of indices {i,j,r, s}. By this 
we mean the following: we say that there is overlap between two indices 
i,j if \i — j\ < kn- The number of overlaps of a set of indices {i,j,r,s} is 
denoted by 6{i,j, r, s) and is the number of unordered pairs of indices which 
have overlap. Since we restrict in the sum (4.8) to \i — j\ > 2kn, \r — s\> 2kn, 
it follows from the triangular inequality that in that case 6{i,j,r,s) < 2. 
Therefore, we split the sum into three cases 

(4.9) E E n{AkMAkJ,{BkMBkJs) 

i,j,r,s,\i-j\>2kn,\r-s\>2kn Ak„,Bk„ 
= So + Si + S2, 

where 

(4.10) Sp= E J2^{{A,J,iA,J,iB,JriBkJs), 

(i,j,r,s)eKk,pA,B 

where we abbreviated 

(4.11) Kk^.p = {{i,j,r,s) : \i - j\ > 2A:„, \r- s\> 2kn,6{i,j,r,s) =p} 

to be the set of indices such that the overlap is p. 

1. Zero overlap: Sq. 

We use Lemma 2.13, and notation (4.7): 

(4.12) E E mkjnBkj^c 

2. One overlap: Si. 

We treat the case |i — r| < kn, i < r < j < s. The other cases are treated 
in exactly the same way. Put Ak„ = [ai,a2, . . . ,afc„], Bk„ = 62, . . . , 
The intersection (Afc„)j H (Bk^)r is nonempty if and only if Or = 61, a^+i = 

• • • 5 = &fc,i-r+ii that is, the last fc„ — r + 1 symbols of A^^ are equal 
to the first A;,„ — r + 1 symbols of B^^ . 
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Therefore, we obtain that the sum over the patterns Ak^^B^^ in Si equals 

= E H{^k,BkSkn-r,K))i{AkJ.j{AkSr,K)Bk^{kn-r,K)), 
A 

(4.13) 



Kfi i^n 



J2 nAkAr,kn)fnAkAl,r-l)f¥{Bk,Akn-r,kn)y 



^ g-3(fc„-r)ag-2rag-2ro 

Summing over the indices {i,j,r,s) £ K{knA) then gives 
(4.14) 5i ^ n^e-^"*^" ^ ^ ^n- 

r<k„ 

3. Two overlaps: 82- 

We treat the case i < r < j < s and r — i < kn, s — j < kn- Other cases 
are treated in the same way. Put li := i + kn — r + l,pi = j + kn — s + 1. 
We suppose h > pi. Then the last li symbols of Ak„ have to equal the first 
h symbols of Bk„, otherwise the intersection iAkJi{AkJj{BkJriBk„)s is 
empty. Therefore, we obtain that the sum over the patterns Af^^ , i?^^ in S2 
equals 

E niAkMAkJjiBkMBkJs) 



(4.15) 



= E ^ii^k„Bk^_i^)i{Ak^Bk„_p^)j) 

< E nAkSnBK~H?d'-^' 

~< e-2/cOg-2(fc-«i)a^«i-pi 

Summing over the indices in K{k, 2) then gives 

(4.16) Ss^nV^'^"" e-^^e^"-'!) ^ p^'~P' ^Cl 

h<k„ Pi<h 

Using the bounds (4.12), (4.14) and (4.16) in (4.8) and (4.9), we deduce 
(4.6) and then, as explained below, statement 1 of the theorem follows from 
the Paley-Zygmund inequality. This completes the proof. □ 

The following result relates Theorem 4.1 and the behavior of the maximal 
shift-matching, and is the analogue of Theorem 1 in [6] (which is, however, 
convergence almost surely for more general comparison of sequences based 
on scores, but for independent sequences). 
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Proposition 4.2. Let M{a,n) be defined as in Definition 2.2. Recall 

a=p{U) —. 

Then we have that 

M{a,n 



Q, 



n 

where the convergence is in probability. 

Proof. Use the relations of Proposition 2.4. We have 

pf > (1 + e)) < ¥{N{a, n, I a(l + e) lognj ) > 1) 

V alogn y 

and 

P ( ^ll^lA < (1 _ e)^ < p(Ar(^, n, \a{\ - e) log n] ) = 0). 
\ alogn / 

So the result follows from Theorem 4.1. □ 

5. Two independent strings. In this section we study the number of 
matches with shift when two independent sequences a and r/ are considered. 
The marginal distributions of cr and r] are denoted with P and Q, which 
are chosen to be Gibbs measure with exponentially decaying translation- 
invariant interactions U{X,a) and V{X,r]), respectively. We assume the 
two strings belong to the same alphabet A. In analogy with the case of one 
string, we give the following definition. 

Definition 5.1 (Number of shift-matches for 2 strings). For every cou- 
ple of configurations o", € O x and for every n G N, € N, with k <n, we 
define the number of matches with shift of length k as 

n—k n—k 

(5.1) N{a,v,n,k) = Y^ HMc, = {rj7])cj- 

i=0 j=0,j^i 

Of course, in the case a = rj we recover (up to a factor 2) the previous 
Definition 2.1, that is, N{a,a,n,k) = 2N{a,n,k). 

5.1. Identical marginal distribution. We treat here the case Q = P, that 
is, the two sequences a and rj are chosen independently from the same Gibbs 
distribution P with interaction U{X,a). Then the results of the previous 
section are generalized as follows. 



Theorem 5.2. Let {k{n)}n£n be a sequence of integers: 
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1. If k* (n) ^ k{n) , t/ien lim„_^oolEp®p[A^(o", ?7,n, A;(n))] = oo. 

2. If k* {n) <^ k{n) , then limn-,Qo^F^F[N{a,r),n,k{n))] = 0. 

3. If k[n) — k*{n) is a bounded sequence, then we have 

< liminf Ep,g,p(A^((T, ?/,n, 

n— >oo 

(5.2) 

< limsupEp0p(A^(o", ry, n, /c(n))) < oo. 

n— >oo 

Proof. Because of independence, we immediately have 

Epg,p[iV((T,??,n,fc)] 



(5.3) 



n—k 



{n-kf E ^(^k) 



2 



--fe 

«(n-A:)2e-2'^". □ 

Theorem 5.3. Lei {k{n)}neN be a sequence of integers. For every pos- 
itive m G N.- 

1. If k* {n) k{n) , i/ien lim.„_oo P «) F[iV(o-, r/, n, fc(n)) < e] = 0. 

2. Ifk*{n)<^k{n), i/ien lim„_oo P «) F[A^(f7, r/, n, fc(n)) > e] = 0. 

3. // /c(n) — k*{n) is hounded, then N{a,r],n,k{n)) is tight and does not 
converge to zero in distribution. More precisely, we have that there exists 
a constant C > such that 

(5.4) limsupP(g)P(iV((T, 7/, n,A;(n)) > m) < C/m 

n— >oo 

and 

(5.5) liminfP(g)P(7V(CT,r?,n,/fc(n)) >0) >0. 

n — >oo 

Proof. The strategy of the proof is as in Theorem 4.1. Thus, we need 
to control the second moment to show that E(A^^) ~ (E(A^))^. We start from 

Epcg,p(iV^(cr,?7,n,A;)) 

n—k 

(5-6) = E E P((Tn^)c,=^fc,(r.,a)c,=Sfc) 

X H{'rnV)Ck = Ak, {rj^il)Ck = Bk)- 
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Using translation-invariance and defining new indices h = 12 — h and I2 = 
h - ji, we have 

Cn—k 
- ,„ h=i 

n—k \ 

X ^ (n - A; + 1 - hMvc, = Ak, {n^vh, = Bk) ■ 

We have to distinguish three kinds of contributions in the previous sums: 

1. Zero overlap, that is, Zi > k,l2> k. Then 

Cn—k 
^ (n-k + l- h)n<^c, = Ak, {Ti,a)c, = Bk) 
- li=k+l 

n—k \ 

X J2 in-k + l-l2)nVc,=Ak,iTi,v)c,=Bk)] 

h=k+l J 

(5.7) 

^{n-k)'' Yl HAkfnBk? 

2. One overlap. We treat the case h <k and l2> k (other cases are treated 
similarly). We have 

(k 
Y^^n-k + 1- /i)P(ac, = Ak, {Ti,a)c, = B^) 

n—k \ 

X {n-k + l-l2)nVc,=Ak,{Ti,7])c,=Bk)\ 

h=k+l / 
k 

^(n-kfY. E nDhEk-i,Fi,)nDi,Ek-i,)nEk-hFi,) 

'1=1 -Dij ,Fi^ 

^{n-kfY E nDi^fnEk-HfnPh? 



(5.8) 



'1— 1 Di^ ,Fi^ 



^ (n - kf Y e-2'i°e-2'i°e-3('=-'i)" 
h 

< {n-kfe 



h=i 

\3 Sfco! 
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3. Two overlaps. We treat the case li <l2 <k (other cases are treated sim- 
ilarly). We have 

Ck 

X ^ (n - A; + 1 - hMvc, = Ak, {n,r,)c, = Bk)\ 

l2=l I 

k 

li,l2=l Di^,Ei^_i^ ,Fk^i^ ,Gi^ ,Hi^^i^ 

xF{Di,Ei,^i,Fk^i,Gi,Hi,^i,] 

k 



(5.9) 



177, 



'1 '2-'l 



^2iiap-2(«2-ii)ag-2(fc-«2)ag-2/ia 



< {n-kfe-^^''. 

Combining together (5.7), (5.8) and (5.9) and similar expression for other 
cases with one and two overlaps, we obtain the second moment condition 
E(A^2) ^ (E(iV))2. □ 

5.2. Different marginal distributions. In the case P 7^ Q, the first mo- 
ment is controlled in an analogous way, but the second moment analysis is 
different, and, in fact, as we will show in an example, it can happen for some 
scale 00 that: 

1. Ep0q(A^((T, t], n, kn)) ^ 00 as 77 ^ cx), 

2. P Q(iV(fT, T/, 77, kn) = 0) > for some 5 > independent of 77. 

This means that in order to decide whether N{a,ri,n,kn) goes to infinity 
P® Q almost surely, it is not sufficient to have Kp^Q{N{a,7],n,kn)) 00. 

We start with the case P and Q Gibbs measures with potentials U, V, 
respectively, and define 

(5.10) a = y{U) + lp{V)-^piU + V)>0 
and 

(5.11) r = i^^, 

a 
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then we have the following: 

Theorem 5.4. Let {k{n)}n£m be a sequence of integers. 

1. If k* (n) ^ k{n) , then linin^ooKp(^Q{N{a, T],n,k{n))) = oo. 

2. If k* (n) <^ k{n) , t/ien lim„^ooIEp®Q(iV(a,?7,n, A:(n))) = 0. 

3. If k{n) — k*{n) is a bounded sequence, then we have 

< liminfEp0Q(iV(o-,r/,n, A;(n))) 

(5.12) 

< limsupEp®Q(iV(a,?/,n, k{n))) < oo. 

n— >oo 

Proof. Start by rewriting 

Nia,ri,n,k) = J2 E E HMc^ = ^k^rjVhk = At}. 

Taking into account the independence of the measures P and Q, we obtain 

Ep^Q{N{a,r],n,k)) 

n—k 

(5.13) = E E mncT)c,=AkM{T,r^)c,=Ak) 

~ _ ^)2g-fcb(c/)+p{v)-p(c/+y)] 
= (^_fc)2e-2'=", 

where in the second line we made use of translation-invariance and Propo- 
sition 2.9. □ 

In case 1 of Theorem 5.4, we will not in general be able to conclude 
that N{a,r],n,k{n)) goes to infinity almost surely as n — > oo. Indeed, if we 
compute the second moment, we find terms analogous to the case P = Q, 
of which now we have to take the P (8> Q expectation. In particular, the one 
overlap contribution will contain a term of the order 

(n-kfY.^{EMEkf. 

If P 7^ Q, this term may however not be dominated by n^e"^''". Indeed, the 
inequality 

3/2 

J2nEk)QiEkf < (Y^FiEkMEk) ' 

Ek Ek 
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is not valid in general. In particular, if P gives uniform measure to cylinders 
and Q concentrates on one particular cylinder, then this inequality will 
be violated. 

As an example, inspired by this, we choose P to be a Gibbs measure with 
potential U, and Q = (5a, where 6a denotes the Dirac measure concentrating 
on the configuration r/(x) = a for all x G Z (which is strictly speaking not a 
Gibbs measure, but a limit of Gibbs measures). In that case P^ Q almost 
surely, 

n—k 

N{a,7j,n,k{n)) = l{{Tia)c^ = [a]k), 
1=1 

where [a]k denotes a block of k successive o's. Therefore, 

P (E) Q(iV(cT, V, n, kin)) = 0) = FiO^a], {cj)>n-k), 

where 

0[a]fc (cr) = inf {j >0:aj = a, aj+i = a, . . . , aj+k-i = a} 

is the hitting time of the pattern [a]^ in the configuration a. For this hitting 
time we have the exponential law [1, 2] which gives 

P(e[a],(c7) >n) >e-^'P(W'=)" 

with A a positive constant not depending on n. Now we choose the scale kn 
such that the first moment of N{a,r],n,k{n)) diverges as n — > oo, that is, 
such that 

n>([a]fcJ^oo. 
Furthermore, we impose that 

n[a]kjn<6 

for all n. In that case 

P(e[„]^^ (a) >n)> e-^^(W'="))" > e'^^ 
which implies N{a,ri,n,kn) does not go to infinity P^Q almost surely. 

Acknowledgment. We thank the anonymous referee for helpful remarks 
and a careful reading. 
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